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Abstract 

Background: Large-scale comparison of genomic sequences requires reliable tools for the search of local 
alignments. Practical local aligners are in general fast, but heuristic, and hence sometimes miss significant matches. 

Results: We present here the local pairwise aligner STELLAR that has full sensitivity for e-alignments, i.e. guarantees 
to report all local alignments of a given minimal length and maximal error rate. The aligner is composed of two 
steps, filtering and verification. We apply the SWIFT algorithm for lossless filtering, and have developed a new 
verification strategy that we prove to be exact. Our results on simulated and real genomic data confirm and 
quantify the conjecture that heuristic tools like BLAST or BLAT miss a large percentage of significant local 
alignments. 

Conclusions: STELLAR is very practical and fast on very long sequences which makes it a suitable new tool for 
finding local alignments between genomic sequences under the edit distance model. Binaries are freely available 
for Linux, Windows, and Mac OS X at http://www.seqan.de/projects/stellar. The source code is freely distributed 
with the SeqAn C++ library version 1.3 and later at http://www.seqan.de. 



Introduction 

Computing good local alignments is a fundamental pro- 
blem in bioinformatics. By looking for local alignments 
of biological sequences one aims for example at identify- 
ing homologous regions, i.e. regions that are assumed to 
originate from the same ancestral sequence, or at find- 
ing functionally similar sequences. The problem has 
been studied for more than 30 years [1,2], but still 
remains interesting. In the beginning, local alignments 
were used to look for homologous regions in relatively 
short protein or nucleic acid sequences. Also, for a long 
time, local alignments have been used to identify con- 
served, functionally related elements. More recently, 
local alignments were applied on a genomic scale as 
prerequisite to global genomic alignments. For several 
reasons genomic scale alignments are usually not colli- 
near and hence one has to resort to computing local 
similarities. Now the aim is not anymore to identify 
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some homologous regions but rather to display all simi- 
larities between two or more genomic sequences [3-7]. 
This requires not only efficiency in computation to pro- 
cess very long sequences, but also accuracy regarding 
sensitivity, i.e. exact tools that do not miss regions of 
significant local similarity. 

For the computation of local alignments numerous 
tools have been developed: Early tools such as 
SSEARCH [2] and FASTA [16] are sensitive but too 
slow for large-scale sequence comparison. Then, there 
are efficient heuristics, with the BLAST family [17-19] 
being the most prominent example. Further develop- 
ments for specific large-scale analyzes resulted in tools 
like BLAT [20] which was designed for high speed, and 
BLASTZ [21] which was designed for higher sensitivity. 
The more recently published tool BWT-SW [22] again 
focuses on sensitivity and is able to report all local 
alignments. 

To assess homology of biological sequences by local 
alignment, generally some kind of similarity criterion is 
necessary. The most widely accepted criterion is the E- 
value[18,23], a probabilistic measure that assesses the 
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significance of a local alignment. The E-value denotes 
the expected number of local alignments with a minimal 
score occurring by chance in the input sequences. The 
score underlying the E-value is a Smith- Waterman-like 
score, most commonly using affine gap costs. All popu- 
lar alignment tools that report E-values, e.g. BLAST, 
compute such a score, and afterward apply an E-value 
threshold on their output. In this paper, we describe a 
method that follows an alternative approach to compute 
only significant local alignments of nucleic acid 
sequences: We compute only high-scoring alignments, 
which are guaranteed to have a good (low) E-value. We 
use a maximal error rate for local alignments (normal- 
ized edit distance) as a score threshold, and additionally 
require a minimal alignment length. Our method is spe- 
cialized on relatively low error rates, which in turn justi- 
fies the use of edit distance instead of affine gap costs. 
For a given error rate and a minimal alignment length 
our method is exact, i.e. it identifies all local alignments 
without loss of sensitivity. We compute an E-value for 
all generated local alignments and a minimal E-value for 
the input parameters. The method is implemented in 
the program STELLAR (SwifT Exact LocaL AligneR) 
using the SeqAn C++ library [24,25]. The program 
depends only on very few and clearly understandable 
parameters. We prove that our algorithm is exact for all 
reasonable parameter settings and confirm this experi- 
mentally. We compare STELLAR against popular local 
alignment programs, namely BLAST, LASTZ, BLAT, 
and BWT-SW in terms of speed and sensitivity and 
show that some of the tools miss many significant local 
alignments that can be detected with STELLAR. 

Methods 

Definitions and overview of algorithm 

A pairwise local alignment of length n is a sequence of 
n match, insertion, deletion, and substitution columns. 
In our approach deletion, insertion, and substitution col- 
umns are all treated equally. Hence, we will call these 
columns error columns. The number of error columns 
is the edit or Levenshtein distance of a local alignment. 
Normalizing this distance by dividing it by the length of 



the local alignment, we obtain an error rate. An e-match 
is a local alignment that has an error rate of at most e > 
0 and length n > n 0 . Fig. 1 shows two examples of 
e-matches as segments of a longer local alignment. 

The notion of an X - drop to delineate local align- 
ments from each other is well established by Miller and 
coworkers in the context of similarity alignments 
[18,26,27]. An X-drop within an alignment, where X > 0 
is given, is a region of consecutive columns with a total 
score of -X or less. In other words, it is a region where 
the score drops by X or more. In Fig. 2 we display an 
example in which we score error columns by -1. The 
X-drop is a very intuitive way to model local dissimilar- 
ity and hence we choose to adopt the concept for our 
model of local similarity. Since we address the computa- 
tion of e-matches in this paper we propose the following 
scoring scheme that depends on the error rate e: We 
score a match by +1 and an error by p = -lis + 1. In 
addition, we adjust the score drop-off parameter by 
multiplying it with the negative of the error penalty -p 
such that the user specified parameter X still corre- 
sponds to the number of errors in a match-free X-drop 
region and is easy to grasp for the user (Fig. 3). To 
emphasize the difference from the usual understanding 
of an X-drop we call this weighted X-drop an s-X-drop. 

The goal of this work is to find e-matches of two 
sequences without e-X-drops. Often, e-matches overlap 
to a large extent (see Fig. 1 for an example), or segments 
of e-matches are themselves e-matches (e.g. we obtain an 
e-match by removing a column from one end of the 
e-matches in Fig. 1). Thus, the number of e- matches can 
be very large with very redundant similarity information. 
We handle this issue by only identifying the longest 
e-match of one location: If two e-matches overlap, we out- 
put only the longer one unless the overhanging part of the 
shorter e-match still has a minimal length of n 0 . In that 
case we output both complete e-matches. Thus, for the 
example in Fig. 1 we only output one e-match. We say 
that such an e-match is maximal 

We now present an algorithm to compute exactly all 
maximal e-matches without e-X-drop. Our algorithm 
runs in two phases: filtering and verification. The 
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Figure 1 Overlapping e-matches. An alignment of two strings containing two overlapping e-matches for e = 0.1 and n 0 = 20. The e-match 
indicated by the dashed box has an error rate of 2/22, the e-match indicated by the box with a continuous line one of 2/25. The union of these 
two e-matches from position 4 to position 31 is not an e-match: the error rate is 3/28 > 0.1. Furthermore, the intersection of the two e-matches 
is with 19 columns too short to be an e-match. 
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TGCCTGGACTATTGATCCTGGCTCAGGATGA 


CGCTAGG- 


ACGC 


TGCCTGGACTATTCCGCCTGGCTCAGGATGA 


GG-TACAT 



X-drop 

Figure 2 X-drop. An alignment containing an X-drop for X = 3. In this example, an e-match with s = 0.1 and n 0 = 20 (indicated by the box) 
spans the X-drop. 



filtering phase implements the SWIFT algorithm [28], a 
very efficient full-sensitivity filter for ^-matches. Note 
that SWIFT is a filter algorithm and does not output a 
list of ^-matches. While it guarantees not to miss any 
maximal £-match and hugely reduces the search space, a 
verification phase is necessary to identify false positive 
hits of the filter. Furthermore, verification is needed to 
determine the exact start and end positions of maximal 
^-matches. We have developed a verification strategy 
that runs in five steps: £-core identification, £-X-drop 
core filter, £-X-drop extension, identification of maximal 
^-matches, and filtering of overlapping matches. Verifi- 
cation may stop after any of these steps if it is clear that 
we will not identify a new e-match without e-X-drop. 
The strategy guarantees to find all maximal ^-matches 
without £-X-drop. 

A similar two-step algorithm that consists of SWIFT 
filtering and subsequent verification is implemented in 
the read mapper RazerS [14]. The difference to RazerS 
is, however, that we are looking for ^-matches that are 
local in both sequences whereas read mappers compute 
semi-global alignments, i.e align the full read sequence 
to a reference. RazerS uses a slightly modified SWIFT 
filtering, and the verification is much simpler since the 
length of the final alignments is preset by the read 
length. 

Filtering phase 

The SWIFT algorithm proposed by Rasmussen et al. is 
an efficient #-gram based filter to detect potential s- 
match regions between two sequences. It is based on 



the #-gram lemma [29,30]. This lemma states that every 
alignment of length n with k error columns contains at 
least T(n, k, q) := n + 1 - q(k + 1) #-hits, substrings of q 
consecutive match columns. Considering the dotplot of 
two sequences, every #-hit corresponds to a diagonal 
stretch of matches with length q. Obviously, all #-hits of 
an alignment with k errors can cover at most k + 1 dif- 
ferent diagonals in the dotplot. 

Rasmussen et al. proved that for any given s and n 0 
there exist w, q, e and r such that every e-match con- 
tains r ^-hits that reside in a w x e parallelogram. A w x 
e parallelogram is the intersection of e + 1 consecutive 
diagonals and w + 1 consecutive columns in the dotplot. 

To detect w x e parallelograms with r #-hits in the 
dotplot, the SWIFT algorithm slides from left to right 
over one sequence and searches overlapping ^-grants 
in a #-gram index of the other sequence. Found #-hits 
are counted in bins of A + e consecutive diagonals 
whose first diagonal is a multiple of A. As adjacent 
bins share e diagonals, every w x e parallelogram is 
contained in one bin. Every bin contains a #-hit coun- 
ter and represents the parallelogram with columns 
bounded by the leftmost and rightmost contained 
#-hit. If a q-hit is found that is at most w - q columns 
apart from the rightmost #-hit, the parallelogram is 
extended. Otherwise it is closed and a new one starting 
at the found #-hit is opened as the two hits cannot be 
part of the same w x e parallelogram. A closed paralle- 
logram whose bin counter has reached r is output as a 
SWIFT hit and verified as described in the following 
section. 
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Figure 3 e-X-drop. An alignment containing an e-X-drop for X = 3 and s = 0.1. To reach an e-X-drop with a score drop-off of at least 
3-(yj-lj = 27, a fourth error column is necessary in this example because of the positively scoring matches in between the errors. 
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Fig. 4 shows examples for SWIFT hits containing 
either a subalignment of an £-match, whole ^-matches, 
no £-match or an £-match with an X-drop. 

Verification phase 

Fig. 4 demonstrates that the output of the filtering phase 
is not yet a list of ^-matches, although the SWIFT algo- 
rithm hugely reduces the search space. SWIFT hits may 
contain one or more ^-matches, but may as well be false 
positive and contain no £-match. Some SWIFT hits may 
overlap and contain the same or parts of the same s- 
match. Further, they may be much longer than a con- 
tained £-match, or they may cover only parts of an s- 
match. Therefore, we have developed the following veri- 
fication strategy. 

We start verifying SWIFT hits by identifying a seg- 
ment of an £-match that overlaps with the SWIFT hit. 
We call such a segment an s-core. We guarantee not to 
miss any e-match by identifying all e-cores contained in 
a SWIFT hit. e-cores will then serve as starting points 



for extension, possibly beyond the ends of SWIFT hits. 
Finally, we cut the longest s-matches from extended 
£-cores and remove overlapping ^-matches. 
Definition and existence of s-cores 

Under the simple scoring scheme where a match scores 
+1 and an error p = -lis + 1, we define an e-core of an 
e-match as a segment with a score of at least: 



fe{o,i} 



[srii J + l 



where n 0 is the minimal length of an £-match and 

n i ~ I" ( L £n o J + 1 ) / 8 1 * s tne next l ar g er length that 
allows one more error than n 0 . 

In the following two lemmata we prove the correct- 
ness of our approach that starts verification from 
£-cores. 

Lemma 1. Every s-match contains at least one s-core. 
Proof. [ sn J is the maximal number of errors in an s- 
match of length n. Thus, the number of matching 
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Figure 4 Example SWIFT hits. Example SWIFT hits for n 0 = 20, s = 0.1, and q = 6 (= s min ). Accordingly, w = 20, r = 3, and e = 2. SWIFT 
searches parallelograms that contain at least r = 3 g-gram hits by streaming over sequence 1 and searching common g-grams in sequence 2. 
Subfigure (a) shows an e-match that results in two SWIFT hits and the e-match is longer than both of the two hits, (b) shows a SWIFT hit that 
contains two e-matches and (c) shows a false positive SWIFT hit induced by three separated g-gram hits, (d) shows a SWIFT hit that contains an 
e-match with a 3-drop. 
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minimal e-match length n 0 



Figure 5 Sawtoothed function. Plot of the sawtoothed function l(n, s) = 



n - |_en J 
[enj+l 



for n 0 and n 1 = 



[gn 0 J + l 



with s = 0.05. 



positions in an e- match of length n is at least n - [ sn J . 
These n - [ sn J matching positions can be split by the 

errors of the e-match into at most [ sn J + 1 error-free 

segments. If the errors are spread evenly over the e- 
match, at least one of the error-free segments has length 

n — I sn I 

> l(n, e), where l(n r s):=-. —. — -. Some thought 

|_en J + 1 

reveals that any other distribution of the errors would 
result in at least one longer error-free segment. There- 
fore, an e-match of length n contains at least one e-core 
of length > l(n, s). Unfortunately, l(n, s) is a sawtooth 
function, i.e. l(n, s) is not monotonically increasing in n 
(Fig. 5). Hence, one cannot use l(n 0 , e) as a bound for 
all n > n 0 . The function drops to a minimum at each 

point [\ i / s ]} ieN ♦ However, it is easy to confirm that 

the minima are strictly increasing, i.e. each successive 
minimum of the sawtooth is higher than the previous one. 
Therefore, the smallest value of l(n, s) over all n > n 0 is the 
minimum of l(n 0 , s) and l(n x , s). This is exactly the defini- 
tion of 5 min , i.e. an e-match contains at least one error-free 
segment of length s min which is an e-core. D 

Lemma 1 settles the existence of an e-core for every 
e-match. Unfortunately, the SWIFT filter only guaran- 
tees to report SWIFT hits that overlap a part of each 
e-match. Hence, in principle, the SWIFT hit could not 



contain an e-core, which in turn could make our 
algorithm miss the e-match. In the next lemma we will 
show that for a certain value of the parameter q this is 
never the case. 

Lemma 2. The intersection of a SWIFT hit with an 
s-match contains at least one s-core if q := s mm . 

Proof By definition, the intersection of a SWIFT hit 
with an e-match contains at least r ^-grants interspersed 
by at most e errors. Therefore, every SWIFT hit 
contains at least one segment of at least |~r /(e + l)] 
consecutive ^-grants. The length of this segment is at 
least |~t + + Because r > 0 and e > 0 the 

first summand f + is greater or equal one, 
so we obtain [~r /(£ + l)] + ^- l>^. Thus, if we set 
q : = s mm every SWIFT hit contains at least one segment 
of length s mm , which is an e-core. D 
Step 1: E-core identification 

In our verification strategy, we identify e-cores by apply- 
ing a banded version of the Waterman-Eggert local 
alignment algorithm [31]. The original algorithm com- 
putes all non-overlapping local alignments that reach a 
specified minimal score under a certain scoring scheme 
by dynamic programming (DP). We use the scoring 
scheme that scores matches by +1 and errors by p = - 
lis + 1 and set the minimal score to s min . In our ver- 
sion, we reduce running time and space requirements of 
the algorithm by banding the computation of the DP 
matrix according to the parallelogram shape of SWIFT 
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hits (see Fig. 4, only the shaded parts of the alignment 
matrix need to be computed). Thus, per e-core there is 
a maximum running time of 0(w'e) where w' is the 
length and e the width of the corresponding SWIFT 
hit. 

Since the Waterman-Eggert algorithm reports only 
non-overlapping local alignments one may think that 
some e-cores will not be identified because they are 
hidden by longer local alignments that reach beyond 
the end of an e-match. However, this can only be for 
non-maximal e-matches since we have chosen the 
scoring parameters such that e-cores extended by 
additional local alignment columns only have higher 
scores if the additional columns have themselves an 
error rate of at most e (Fig. 6). This is why all maximal 
e-matches will also include the additional columns, i. 
e. no local alignment will reach beyond the end of an 
e-match. 

Step 2: E-X-drop core filter 

The second step of our verification strategy is a filter for 
e-X-drops in the e-cores. In the previous step we 
ignored e-X-drops in e-cores. If now one of our e-cores 
contains an e-X-drop, the e-core should be divided into 
two cores in order to remove the e-X-drop. Similarly, if 
an e-core contains more than one e-X-drop, the e-core 
should be divided into even more cores. 

For this decomposition of the e-cores, we apply the 
post-processing algorithm from Zhang et al. [27] with 
the same scores and penalties for matching and error 
positions as in our e-core identification step. The worst- 
case running time of this algorithm is linear in the 
length of the e-core. 

Possibly, we obtain more than one e-core after this 
step, but in any case the following extension step has to 
be conducted only to the left and right of the original 
non-decomposed e-core. If we started with an e-core 
with more than one e-X-drop, we can skip the following 
extension step for the middle parts, since the extension 



algorithm would run immediately into the previously 
detected e-X-drops. 
Step 3: E-X-drop extension 

The goal of the extension step is to obtain a region that 
spans all e-matches without e-X-drop containing the 
e-core. In this region, the extended e-core, we can then 
look for the maximal e-match. Clearly, we can discard 
extended e-cores that are shorter than n 0 . 

For extension we apply the gapped extension algo- 
rithm by Zhang et al. [19] with the e-adjusted scoring 
parameters as before. This algorithm is a score-only 
algorithm, i.e. it reports only the score and the sequence 
positions of the maximal extension but not the precise 
alignment. However, it reports the maximal and mini- 
mal diagonal of the alignment matrix (a band) that 
needs to be computed when looking for the precise 
alignment. We will determine the alignment in the next 
step of the verification strategy together with the exact 
begin and end positions of the maximal e-match. 

It is hard to do an informative running time analysis 
for this step. In theory, the dynamic programming algo- 
rithm could fill big parts of the alignment matrix. How- 
ever, for very similar sequences only a narrow diagonal 
stretch will be filled, and for very distinct sequences we 
will soon reach an X-drop and stop. Still, we can esti- 
mate the running time by O (bL) where L is the length 
of the extension and b is the width of the band. 

It is easy to confirm that if an e-core is part of an e- 
match without e-X-drop, then the extended e-core spans 
this e-match without e-X-drop. 
Step 4: identification of maximal E-matches 
The remaining task is to determine the longest align- 
ment in the extended e-core that has an error rate of at 
most e. More precisely, we are looking for the longest 
extension to the left and to the right of the e-core such 
that the complete alignment has an error rate of at 
most e. The maximal error rate (i.e. the number of 
errors and length) that we can allow for the extension 



10 



15 



20 



25 



30 



TC-TjTGACTGGACTGTTCATCCTGGCiTCAGGATGA 



ACGCTGCCTGGACTATTCATCCTGGCi-CAGGACGA 



■ 1 match columns, is still an e-match. Therefore, an 



£-core 

Figure 6 Extended e-core. An e-match extended by one error column and — p ■ 
e-core should include such an extension, since all maximal e-matches at this location will include it. In this example where e = 0.1 and n 0 = 20 
we see an error-free segment of length 7 > 6 = s min that can be extended to an e-core of length 19 including one error. 
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to the left depends not only on the error rate of the s- 
core but also on the error rate of the extension to the 
right, and vice versa. Therefore, we cannot determine 
the lengths of the right and left extension separately. 
Furthermore, depending on the length of the extension 
the optimal trace through the alignment matrix can dif- 
fer (Fig. 7). Our suggestion is to compute for all possible 
extension lengths the optimal end position of a trace in 
the alignment matrix, then to determine the optimal 
lengths of the right and left extension, and lastly to 
carry out the traceback. The details of these three steps 
are described in the following. 

The computation of the optimal end position of a 
trace for all possible extension lengths can be done 
along with the computation of the alignment matrix. 
Unfortunately, two traces of different lengths may end 
in the same column of the alignment matrix. For this 
reason, it is necessary to compute the alignment matrix 
using an algorithm for normalized alignment score 
[32,33], which in addition iterates over all alignment 
lengths. As already mentioned, the alignment matrix can 
be banded by the minimal and maximal diagonal from 
the seed extension algorithm. 

Along with the alignment matrix computation we 
check and if necessary update in each iteration step a 
list bestEnds with the best alignment matrix entry for 
the corresponding extension length. Each entry consists 
of the length and score (or number of errors) of the 
alignment, and coordinates in the alignment matrix. 
This list can afterward be reduced to a smaller set of 
possible lengths using the following observation: The 
position before the start and the position behind the 
end of the sought e-match is an error, otherwise the s- 
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Figure 7 Extension traces of an £-core. Alignment matrix with 
two possible traces of the extension of an e-core. Depending on 
the length of the extension (red numbers), we obtain the lowest 
error rates by aligning different sequence positions to each other. 
For an extension length of 5, 7, and 9 it is better to follow the 
upper trace, whereas for a length of 1 1 the lower trace has fewer 
errors. For all other lengths there is a longer trace with the same 
number of errors, and therefore it is not necessary to consider those 
lengths while looking for maximal e-matches. 



match would not be maximal. Hence, we only need to 
keep those list entries for lengths / where the score dif- 
ference of bestEnd[l] and bestEnd[l + 1] is smaller than 
the score of a match. As a result we obtain two lists 
with possible traceback starting points, one for start 
positions of the £-match obtained from the left exten- 
sion, and one for end positions obtained from the right 
extension. 

On these lists we then apply the following exhaustive 
search algorithm that iterates over combinations of pos- 
sible start and end positions: We start with the leftmost 
possible start position and iterate over possible end 
positions from right to left. We continue with the next 
possible start position and restart with the rightmost 
possible end position as soon as the segment between 
the current start and end position has an error rate of 
at most s (update currently longest e-match), or if this 
segment is shorter than the minimal e-match length or 
our currently longest e-match (do not update currently 
longest £-match). The algorithm stops when the seg- 
ment between the current start position and the right- 
most possible end position is shorter than the minimal 
£-match length or the currently longest e-match. Using 
this strategy we cannot miss the longest e-match with- 
out e-X-drop if the £-core is part of one. 

In case there is another maximal e-match containing 
the £-core, we have to recurse this search twice with the 
lists reduced by the following entries: All start positions 
before the start position of the longest e-match and all 
end positions that are smaller than n 0 added to the end 
position of the longest e-match; and all start positions 
before the start position of the longest e-match minus 
n 0 and all end positions behind the end position of the 
longest £-match. 

As a last step, we have to look up the coordinates for 
the optimal extensions in the bestEnd lists and start tra- 
ceback from these positions in the alignment matrices. 
The £-core extended by the resulting alignments is a 
maximal £-match that contains the £-core. 

The running time for computing the alignment matrix 
using a normalized alignment score is in Q{bL 2 ) where 
b is again the width of the band and L is the length of 
the extension. Dropping the normalization of the align- 
ment score reduces running time by a factor of L. 
Determining the optimal start end end position in the 
left extension of length L x and right extension of length 
L 2 takes 0(L 1 L 2 ) time per e-match, and finally, the tra- 
ceback takes time linear in the length of the final s- 
match. 

Step 5: removal of largely overlapping s-matches 

An £-match identified during the verification phase is 
the longest that contains one specific e-core but it is not 
necessarily maximal in that it does not overlap with a 
longer e-match. In addition, an e-match containing two 
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s-cores will be identified twice. To ensure that we out- 
put each e-match only once and also only maximal s- 
matches, this last step is necessary. 

We remove overlapping ^-matches by sorting the s- 
matches by their begin position in one sequence and 
pairwisely comparing here overlapping matches further. 
If two ^-matches are found to be identical, one is dis- 
carded. Also, if the shorter of the two ^-matches has no 
unique part of length n 0 , this e-match is discarded. The 
running time of this last step is dominated by sorting 
the ^-matches, i.e. it is in O (M log M) , where M is the 
number of ^-matches before removal. 

Theorem.Let M be the set of maximal e-matches with- 
out s-X-drop between two sequences. Then the algorithm 
that uses SWIFT for filtering and the described strategy 
for verification will detect exactly all s-matches in M. 

Proof The SWIFT filter algorithm guarantees to 
report at least one overlapping SWIFT hit for every s- 
match of the input sequences. The first step of the veri- 
fication strategy detects all e-cores in SWIFT hits. Apply 
Lemma 1 to prove that every e-match contains an s- 
core. According to Lemma 2, one of the e-cores of every 
£-match is contained in a SWIFT hit. Thus, for every s- 
match in M the first verification step identifies at least 
one e-core. 

Let C be the set of e-cores identified during the first 
step, and let C <= C be the subset of e-cores that are 
part of an e-match in M. Since none of the ^-matches in 
M contain an e-X-drop, the local alignment obtained 
after c-X-drop extension (Step 3) of an e-core c s C 
spans the corresponding e-match. 

By cutting the extended e-cores as described in Step 4, 
we eventually end up with a set of ^-matches M f that 
each contains a certain £-core. Step 4 also guarantees 
that per e-core no longer e-match exists than the s- 
match in M\ Therefore, after removal of overlapping s- 
matches (Step 5), our set of ^-matches contains exactly 
all maximal ^-matches without e-X-drop. D 

Results and discussion 

We have implemented the algorithmic pipeline in the 
program STELLAR following exactly the above 
described steps with one exception: To improve running 
time, STELLAR computes the alignment matrix during 
the identification of the longest e-match with unnorma- 
lized alignment score. The following results show that 
this has in practice no effect on the sensitivity. 

We tested STELLAR on simulated and on real geno- 
mic data and compared its performance to BLAST [18], 
LASTZ as the replacement of BLASTZ [21], BLAT [20], 
BWT-SW [22], and Smith-Waterman alignments 
obtained with SSEARCH from the FASTA package [16]. 
In addition, we ran BLAST with a more sensitive para- 
meter setting: According to Lemma 1, every e-match 



contains a seed of length s min ) and therefore, we set 
BLAST'S word-size parameter to the corresponding s mm 
computed in STELLAR. To demonstrate the differences 
between the programs we compared speed and sensitiv- 
ity on all data sets. Running times were measured on a 
2.66 GHz Intel Xeon X5550 with 72 GB of RAM run- 
ning Linux. Running times of BWT-SW include pre- 
processing of the database sequence. In all test runs 
STELLAR needed less than 1 GB of RAM, so we omit 
further details of memory usage. As a measurement for 
sensitivity we computed the percentage of matches from 
a reference set that were sufficiently covered by matches 
from the respective program. We say that matches that 
are covered by less than 10% are missed (Fig. 8). This is 
a very loose criterion which is in favor of the compared 
programs. 

Simulated sequences 

To demonstrate STELLAR's gain of sensitivity in com- 
parison to other programs, we used simulated data sets. 
The advantage here is that the reference set of local 
alignments for the computation of the sensitivity is 
given. We simulated random sequences with uniformly 
distributed characters from the alphabet {A, C, G, T}. In 
addition, we simulated random local alignments of 
length 50-200 bp and inserted errors at different rates 
into the alignments. In order to see at what error rates 
the programs start missing local alignments, we created 
a first data set where 500 such local alignments with an 
error rate of 0%, 2.5%, 5%, 7.5%, or 10% were inserted 
into a sequence of length 1 Mb at random positions. A 
second simulation was conducted to assess the effect of 
the sequence length. Here, sequences of lengths 1 kb, 10 
kb, 100 kb, 1 Mb, and 10 Mb were simulated containing 




Bi B2 B3 B4 

•I* 4' 4" 

partially fully almost fully almost fully 

covered covered covered covered 

A 6 A 7 A 8 



seq 1 

seq 2 

B 6 B 7 
4, 4, 
not covered not covered 

Figure 8 Match coverage. A set of matches A = A 8 } that 

are to be compared to a set of reference matches {B h B 7 }. We 
say that a match B-, is covered by the matches from A if at least 
10% of the alignment columns agree between B, and any match 
from A. 

K J 
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Table 1 Running times and sensitivity on simulated sequences containing local alignments of different error rates 



error rate 


0% 

time 


missed 


2.5% 

time 


> 

missed 


5% 

time 


missed 


7.5% 

time 


> 

missed 


10% 

time 


missed 


SSEARCH 


133:17 h 


0.00% 


132:55 h 


0.00% 


133:24 h 


0.00% 


132:25 h 


0.00% 


132:52 h 


0.00% 


STELLAR 


2.96 s 


0.00% 


3.30 s 


0.00% 


3.61 s 


0.00% 


3.91 s 


0.00% 


4.44 s 


0.00% 


BWT-SW 


16.59 s 


0.00% 


16.63 s 


0.00% 


16.53 s 


0.00% 


16.29 s 


0.00% 


16.24 s 


0.00% 


BLAST 


0.25 s 


0.00% 


0.26 s 


0.16% 


0.26 s 


5.36% 


0.25 s 


1 7.00% 


0.24 s 


38.80% 


BLAST* 


0.25 s 


0.16% 


0.26 s 


0.00% 


0.28 s 


0.04% 


0.51 s 


0.28% 


14.99 s 


2.60% 


LASTZ 


6.49 s 


0.00% 


6.48 s 


0.72% 


6.22 s 


5.56% 


5.86 s 


12.68% 


5.23 s 


24.92% 


BLAT 


14.30 s 


29.36% 


11.52 s 


29.64% 


14.06 s 


28.88% 


14.71 s 


31.44% 


14.66 s 


34.32% 



Sequences have a length of 1 Mb and contain local alignments of lengths between 50 and 200 bp. The simulations were repeated five times, the displayed 
values are the average of all runs except for SSEARCH which was run only once. Sensitivity is measured by the percentage of missed local alignments (Fig. 8). 
BWT-SW, BLAST, BLAT, and LASTZ were run with default parameter settings. BLAST* stands for a more sensitive run of BLAST with the word size parameter set to 
s mm , i.e. the minimal length of an c-core (see text for details). 



local alignments with error rates between 0 and 10%. 
On these data sets we ran the above mentioned pro- 
grams, with STELLAR's error rate parameter set accord- 
ingly and the minimal match length set to 50. The 
results are shown in Tables 1 and 2. Table 1 demon- 
strates that STELLAR, in particular for the higher error 
rates, outperforms the other programs. SSEARCH has 
full sensitivity as expected but is very slow. We con- 
firmed that with an E-value cutoff of 0.01 it does not 
detect any other than the inserted local alignments. 
BWT-SW reports all local alignments, too, but is still 
much slower than STELLAR. For BLAST and LASTZ 
the number of missed matches is low for very low error 
rates but increases with higher error rates. This implies 
that one can benefit the most from STELLAR when 
comparing closely related sequences that still have sig- 
nificant differences. As an example, Fig. 9 displays one 
s-match that only STELLAR, BWT-SW, and SSEARCH 
identify. BLAST is the fastest of all programs, though 
only with default parameters and lower sensitivity. 
BLAT constantly misses around 30% of all matches. We 
assume that the reason for BLAT's bad performance is 
that it was originally designed for the comparison of 
many short sequences (ESTs or reads) against one long 



sequence and not for the comparison of two long 
sequences. Table 2 supports this assumption, as the 
number of matches missed by BLAT is low for the 1 kb 

- 100 kb sequences but increases up to almost 70% for 
sequences of length 10 Mb. In contrast, the sensitivity 
of BLAST and LASTZ seems not to be affected by dif- 
ferent sequence lengths. STELLAR is in general faster 
than BWT-SW, BLAT, and LASTZ with one exception 

- the 10 Mb sequences. This indicates already a limita- 
tion of STELLAR on very long sequences with high 
error rates. The SWIFT filter has a lower specificity for 
high error rates and generates very many SWIFT hits. 
As a result, many verification steps are necessary, which 
leads to an increase in running time. However, STEL- 
LAR is faster than the sensitive BLAST for the 10 Mb 
sequences and also for high error rates. The sudden 
increase in running time for the sensitive BLAST at 
higher error rates (Table 1) is due to a much lower s min 
for s = 10%. 

Genomic sequences 

We downloaded the assembled genomes of Drosophtta 
melanogaster (release 5.26) and Drosophtta pseudoobs- 
cura (release 2.14) from FlyBase [34]. We selected 



Table 2 Running times and sensitivity on simulated sequences of different lengths 



seq length 


1 


kb 


10 kb 




100 kb 


1 Mb 




10 Mb 






time 


missed 


time 


missed 


time 


missed 


time 


missed 


time 


missed 


SSEARCH 


2.45 s 


0.00% 


259.5 s 


0.00% 


7:16 h 


0.00% 


136:16 h 


0.00% 






STELLAR 


1 ms 


0.00% 


5 ms 


0.00% 


0.07 s 


0.00% 


4.36 s 


0.00% 


782.46 s 


0.00% 


BWT-SW 






787 ms 


0.00% 


1.40 s 


0.00% 


16.33 s 


0.00% 


508.45 s 


0.00% 


BLAST 


4 ms 


14.00% 


8 ms 


6.00% 


0.03 s 


11.40% 


0.25 s 


13.40% 


3.34 s 


12.64% 


BLAST* 


6 ms 


0.00% 


13 ms 


0.00% 


0.16 s 


0.00% 


14.75 s 


0.60% 


2266.64 s 


1 .46% 


LASTZ 


9 ms 


10.00% 


59 ms 


2.00% 


0.56 s 


7.80% 


5.99 s 


9.40% 


116.26 s 


9.12% 


BLAT 


25 ms 


0.00% 


39 ms 


0.00% 


0.40 s 


1 .00% 


15.24 s 


33.00% 


384.79 s 


69.30% 



Sequences contain a = [length/2000] local alignments with a maximal error rate of 10% and lengths between 50 and 200 bp. The simulation of the 1 kb 
sequences was repeated 50 times, the simulation of the 10 kb and 100 kb sequences ten times. The displayed values are the average of all runs except for 
SSEARCH which was run only once. Sensitivity is measured by the percentage of missed local alignments (Fig. 8). BWT-SW, BLAST, BLAT, and LASTZ were run 
with default parameter settings. BLAST* stands for a more sensitive run of BLAST with the word size parameter set to s mm , i.e. the minimal length of an c-core 
(see text for details). 
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8 6 63 6 9 GGGCAGGGAACGCCACATG-GCACATCAAGGCGCATATCAGAGGGACCCA 
I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I 
5057 6 9 GGGC-GGGAACGCCACATGAGCACATCAA-GCGCATATCAGA-GGACCCA 



8 6 6419 CCAAGGGGGACGGAGCTTCCGAACGCTGAGTTCGAACTCCCGGAGC 
I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I 
505819 CCAA-GGGGACGGAGC-TCC-AACGCTGAGTTCGAACT-CCGGAGC 

Figure 9 e-match in simulated data. An example e-match from the simulated sequences of length 1 Mb and s = 10%. This e-match with an E- 
value of 7 x 10" 26 is only found by STELLAR, BVVT-SW, and SSEARCH, but by none of the other programs. 



chromosome arm 2L from D. melanogaster (-23.5 Mb) 
and group 3 from the chromosome 4 assembly of D. 
pseudoobscura (-11.7 Mb) for our test runs. Unfortu- 
nately, this data set is too big to compute local align- 
ments with SSEARCH, and since BLAST performs 
better than BLAT and LASTZ on the simulated data 
we chose to compare STELLAR on the genomic data 
only to BLAST. We expect BLAST to compute very 
short alignments with low error rates and some long 
alignments with higher error rates that do not fulfill 
the minimal length or error rate criterion for s- 
matches, and hence STELLAR will not find them. 
Therefore, to double-check STELLAR's sensitivity, we 
filter all ^-matches from the set of BLAST hits. Addi- 
tonally, some of the longer BLAST hits may contain 
valid ^-matches that we extract and add to the set of 
filtered BLAST hits. 

Results for STELLAR are shown in Table 3 and results 
for BLAST in Table 4. STELLAR identifies for example 
345 ^-matches with an error rate of 10% and a minimal 
match length of 200. As expected, these cover all of the 
^-matches filtered from the set of BLAST hits. In con- 
trast to the simulated sequences, the running time 
increases significantly with a higher error rate or lower 
minimal length. This can be explained with the filter 
algorithm being more specific for higher minimal 
lengths and low error rates as already mentioned above. 
With less specific filtering, many more SWIFT hits need 
to be verified resulting in a higher running time. In the 
future we might be able to reduce this effect by 
parallelization. 



Table 3 Results of STELLAR on drosophila chromosomes 



error min. 
rate length 


running 
time 


num. of 
e-matches 


overlap 
BLAST 0 


10% 100 


7777 s 


3911 


100% 


10% 200 


1566 s 


345 


100% 


5% 200 


21 s 


44 


100% 



We used chromosome arm 2L from D. melanogaster (-23.5 Mb) and group 3 
of chromosome 4 from D. pseudoobscura (~1 1.7Mb). STELLAR was run with the 
X-drop parameter set to 20. 

a Percentage of covered local alignments from accordingly filtered BLAST 
output. 



BLAST with default settings is again the faster pro- 
gram, but misses 17 ^-matches of minimal length 200. 
One of these matches is displayed in Fig. 10. When we 
change the word size parameter such that BLAST is 
able to identify all s-matches of minimal length 200, it is 
slower than STELLAR. STELLAR with minimal length 
100 and error rate 10% is slowest in all tested settings, 
but identifies 408 ^-matches that BLAST with default 
parameters does not find, and 13 s-matches that even 
the more sensitive setting of BLAST does not find 
though these s-matches have an E-value of 6.1 x 10~ 23 
or lower. 

Conclusions 

We presented STELLAR, an algorithm to compute all 
local alignments of a minimal length according to a clear 
quality definition using the established measures error 
rate and X-drop. STELLAR brings exact local alignments 
to the community at the speed of heuristic state-of-the- 
art tools like BLAST, BLAT, or LASTZ. In addition, our 
experiments show that our effort is worthwhile since the 
heuristic tools miss up to about a third of the matches 
using simulated and real genomic data. Compared to its 
closest competitor, BWT-SW, it is in most benchmarks 
faster and offers with the X-drop parameter a possibility 
to exclude local alignments with bad regions. A limita- 
tion of STELLAR is that only s-matches up to a certain 
error rate can be computed since the filtering phase loses 
specificity with increasing error rate. Therefore, for 
longer and less similar though significant local align- 
ments BLAST remains more appropriate. 



Table 4 Results of BLAST on drosophila chromosomes 



word 
size 


running 
time 


num. of 
hits 


overlap STELLAR 
200 b 


overlap STELLAR 
100 b 


default 


9 s 


9504 


95.1% 


89.6% 


8 


2175 s 


29597 


100.0% 


99.7% 



We used chromosome arm 2L from D. melanogaster (-23.5 Mb) and group 3 
of chromosome 4 from D.pseudoobscura (~1 1.7Mb). 
a Percentage of covered STELLAR matches (error rate 10%, minimal length 
200 or 100). 
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17794720 TTTCAGTTGCAGCGAGGGTCGCATCATGTCCTCGGACACCAAGGGATGTGTGGACCGCAACGAGTGCCTGGACCTGCCAT 
I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I 
10 60022 9 TTTCAGTTGCAGCGAGGGACGCATCATGTCGTCGGACACCAAGGGCTGTGTCGATCGCAACGAATGCCTGGATCTGCCCT 

17794700 GTCTGAACGGAGCCACCTGCATCAATCTGGAGCCCCGGCTGCGGTACCGATGCATTTGCCCGGAGGGCTACTGGGGCGAA 

I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I 
10600309 GCCTGAACGGGGCCACGTGCATCAATCTGGAGCCCCGTCTGCGGTATCGCTGCATCTGCCCGGAGGGCTACTGGGGCGAG 

17794580 AACTGCGAGCTGGTGCAGGAGGGACAGCGCCTGAAGCTGAGCATGGGCGCCCTGGGGGCCATATTCGTTTGCCTGATTAT 
I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I 
10600389 AACTGCGAGCTGGTCCAGGAGGGCCAGCGCCTGAAGCTGAGCATGGGGGCCCTCGGCGCCATATTCGTTTGCCTGATTAT 

17794560 CATACTGAGTAAGTA-G-AGTGATGG 

10600469 CATATTGAGTAAGTACGAAATG-TGG 
Figure 10 £-match between drosophila sequences. An example e-match from the drosophila sequences with s = 10% and n 0 = 200. This s- 
match with an E-value of 6 x 10" 84 is not found by BLAST with default parameters. 



As an outlook another relatively new application for 
local alignments that has emerged with the advent of 
cheap next-generation sequencing should be mentioned. 
Standard read mapping programs [8-14] usually can 
only map entire reads to the reference. With the 
increasing read length, there will be more reads that 
span breakage points, e.g. translocations, gene fusions, 
or splice junctions. The application of an efficient and 
exact local alignment program could be one way to suc- 
cessfully map such reads and detect variation [15]. 
Application of STELLAR is especially promising in that 
it uses the error rate for sensitivity control, an estab- 
lished criterion for read mappers. With a downstream 
chaining procedure of the partial read matches detected 
by STELLAR, it may then be possible to detect even 
multiple splits of reads. Hence, for finding local align- 
ments in the tested range of error rates STELLAR could 
replace the heuristic tools. 
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